This vignette describes how to analyse mass-spectrometry based differential localisation experiments using the BANDLE method (Crook et al. 2021). Data should be stored as lists of MSnSets. There is also features for quality control and visualisation of results. Other vignettes are for convergence and other methodology.
bandle 1.0
Bayesian ANalysis of Differential Localisation Experiments (BANDLE) is an integrative semi-supervised functional mixture model, developed by Crook et al. (2021), to obtain the probability of a protein being differentially localised between two conditions.
In this vignette we walk users through how to install and use the R (R Development Core Team 2011)
Bioconductor (Gentleman et al. 2004) bandle package
by simulating a well-defined differential localisation experiment from spatial
proteomics data from the pRolocdata package (Gatto et al. 2014).
The BANDLE method uses posterior Bayesian computations performed using Markov-chain Monte-Carlo (MCMC) and thus uncertainty estimates are available (Gilks, Richardson, and Spiegelhalter 1995). Throughout this vignette we use the term differentially localised to pertain to proteins which are assigned to different sub-cellular localisations between two conditions. One of the main outputs of BANDLE is the probability that a protein is differentially localised between two conditions.
In this vignette and Crook et al. (2021), the main data source that we use to study
differential protein sub-cellular localisation are data from high-throughput
mass spectrometry-based experiments. The data from these types of experiments
traditionally yield a matrix of measurements wherein we have, for example, PSMs,
peptides or proteins along the rows, and samples/channels/fractions along the
columns. The bandle package uses the MSnSet class as implemented in the
Bioconductor MSnbase package and thus requires users to import
and store their data as a MSnSet instance. For more details on how to create a
MSnSet see the relevant vignettes in pRoloc. There is also
additional information and examples in the pRoloc sister package.
The pRolocdata experiment data package is a good starting
place to look for test data. This data package contains tens of quantitative
proteomics experiments, stored as MSnSets. In the next section we load some
real data from this package as a use-case to demonstrate how to run the bandle
package.
The data used in this vignette has been published in Mulvey et al. (2021) and is currently stored as
MSnSet instances in the development version of pRolocdata package. For
convenience the data is also stored in this package until it is available in the
next stable Bioconductor release.
Briefly, Mulvey et al. (2021) performed triplicate hyperLOPIT experiments on THP-1 human leukamia cells where the samples were analysed and collected (1) when cells were unstimulated and then (2) following 12 hours stimulation with LPS (12h-LPS).
In the following code chunk we load 4 of the datasets from the study: 2 replicates of the unstimulated and 2 replicates of the 12h-LPS stimulated samples.
data("thpLOPIT_unstimulated_rep1_mulvey2021")
data("thpLOPIT_unstimulated_rep3_mulvey2021")
data("thpLOPIT_lps_rep1_mulvey2021")
data("thpLOPIT_lps_rep3_mulvey2021")
By typing the names of the datasets we get a dataMSnSet` data summary. For
example,
thpLOPIT_unstimulated_rep1_mulvey2021
## MSnSet (storageMode: lockedEnvironment)
## assayData: 5107 features, 20 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: unstim_rep1_set1_126_cyto unstim_rep1_set1_127N_F1.4 ...
## unstim_rep1_set2_131_F24 (20 total)
## varLabels: Tag Treatment ... Fraction (5 total)
## varMetadata: labelDescription
## featureData
## featureNames: A0AVT1 A0FGR8-2 ... Q9Y6Y8 (5107 total)
## fvarLabels: Checked_unst.r1.s1 Confidence_unst.r1.s1 ... markers (107
## total)
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:
## - - - Processing information - - -
## Loaded on Tue Jan 12 14:46:48 2021.
## Normalised to sum of intensities.
## MSnbase version: 2.14.2
thpLOPIT_lps_rep1_mulvey2021
## MSnSet (storageMode: lockedEnvironment)
## assayData: 4879 features, 20 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: lps_rep1_set1_126_cyto lps_rep1_set1_127N_F1.4 ...
## lps_rep1_set2_131_F25 (20 total)
## varLabels: Tag Treatment ... Fraction (5 total)
## varMetadata: labelDescription
## featureData
## featureNames: A0A0B4J2F0 A0AVT1 ... Q9Y6Y8 (4879 total)
## fvarLabels: Checked_lps.r1.s1 Confidence_lps.r1.s1 ... markers (107
## total)
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:
## - - - Processing information - - -
## Loaded on Tue Jan 12 14:46:57 2021.
## Normalised to sum of intensities.
## MSnbase version: 2.14.2
We see that the datasets thpLOPIT_unstimulated_rep1_mulvey2021 and
thpLOPIT_lps_rep1_mulvey2021 contain 5107 and 4879 proteins respectively,
across 20 TMT channels. The data is accessed through different slots of the
MSnSet (see str(thpLOPIT_unstimulated_rep1_mulvey2021) for all available
slots). The 3 main slots which are used most frequently are those that contain
the quantitation data, the features i.e. PSM/peptide/protein information and the
sample information, and these can be accessed using the functions exprs,
fData, and pData, respectively.
To run bandle there are a few minimal requirements that the data must fulfil.
Data are required to have
- the same number of channels across conditions and replicates
- the same proteins across conditons and replicates
- data must be a list of MSnSet instances
If we use the dim function we see that the datasets we have loaded have the
same number of channels but a different number of proteins per experiment.
dim(thpLOPIT_unstimulated_rep1_mulvey2021)
## [1] 5107 20
dim(thpLOPIT_unstimulated_rep3_mulvey2021)
## [1] 5733 20
dim(thpLOPIT_lps_rep1_mulvey2021)
## [1] 4879 20
dim(thpLOPIT_lps_rep3_mulvey2021)
## [1] 5848 20
We use the function commonFeatureNames to extract proteins that are common
across all replicates. This function has a nice side effect which is that it
also wraps the data into a list, ready for input into bandle.
thplopit <- commonFeatureNames(c(thpLOPIT_unstimulated_rep1_mulvey2021, ## unstimulated rep
thpLOPIT_unstimulated_rep3_mulvey2021, ## unstimulated rep
thpLOPIT_lps_rep1_mulvey2021, ## 12h-LPS rep
thpLOPIT_lps_rep3_mulvey2021)) ## 12h-LPS rep
## 3727 features in common
We now have our list of MSnSets ready for bandle with 3727 proteins common
across all 4 replicates/conditions.
thplopit
## Instance of class 'MSnSetList' containig 4 objects.
We can visualise the data using the plot2D function from pRoloc
## create a character vector of title names for the plots
plot_id <- c("Unstimulated 1st rep", "Unstimulated 2nd rep",
"12h-LPS 1st rep", "12h-LPS 2nd rep")
## plot the data
par(mfrow = c(2,2))
for (i in seq(thplopit))
plot2D(thplopit[[i]], main = plot_id[i])
addLegend(thplopit[[4]], where = "topleft", cex = .75)
bandle functionThe main function of the bandle package is bandle, this uses a complex
model to analyse the data. Markov-Chain Monte-Carlo (MCMC) is used to
sample the posterior distribution of parameters and latent variables
from which statistics of interest can be computed. Here we only run a few iterations
for brevity but typically one needs to run thousands of iterations to ensure
convergence, as well as multiple parallel chains.
First, we need to fit non-parametric regression functions to the markers
profiles, upon which we place our analysis. This uses Gaussian processes. We use
the fitGPmaternPC function and the fitting uses some default penalised
complexity priors (see ?fitGP), which work well. However, these can be
altered, which is demonstrated in the next code chunk
par(mfrow = c(3,4))
gpParams <- lapply(thplopit, function(x) fitGPmaternPC(x))
We apply the fitGPmaternPC function on every MSnSet experiment by calling
lapply over the thplopit list of data. The output of fitGPmaternPC returns
a list of posterior predictive means and standard deviations. As well as MAP
hyperparamters for the GP. As a side effect the posterior predictive
distributions are overlayed with markers protein profiles for each subcellular
class.
The prior needs to form a K*3 matrix (where K is the number of subcellular
classes in the data), and 3 for (1) the prior, (2) length-scale amplitude and
(3) standard deviation parameters (see hyppar in ?fitGP). Increasing these
values, increases the shrinkage. For more details see the manuscript by Crook et al. (2021).
K <- length(getMarkerClasses(thplopit[[1]], fcol = "markers"))
pc_prior <- matrix(NA, ncol = 3, K)
pc_prior[seq.int(1:K), ] <- matrix(rep(c(10, 60, 250),
each = K), ncol = 3)
Now we have generated these complexity priors we can pass them as an
argument to the fitGPmaternPC function. For example,
par(mfrow = c(3,4))
gpParams <- lapply(thplopit,
function(x) fitGPmaternPC(x, hyppar = pc_prior))
By looking at the plot of posterior predictives we can see the GP fit looks sensible.
The next step is to set up the matrix Dirichlet prior on the mixing weights. These
weights are defined across datasets so these are slightly different to mixture
weights in usual mixture models. The \((i,j)^{th}\) entry is the prior probability
that a protein localises to organelle \(i\) in the control and \(j\) in the treatment.
This mean that off-diagonal terms have a different interpretation to diagonal terms.
Since we expect relocalisation to be rare, off-diagonal terms should be small.
The following functions help set up the priors and how to interpret them. The
parameter q allow us to check the prior probability that more than q
differential localisations are expected.
set.seed(1)
dirPrior = diag(rep(1, K)) + matrix(0.0005, nrow = K, ncol = K)
predDirPrior <- prior_pred_dir(object = thplopit[[1]],
dirPrior = dirPrior,
q = 15)
The mean number of relocalisation is small:
predDirPrior$meannotAlloc
## [1] 0.2308647
The prior probability that more than q differential localisations are
expected is small
predDirPrior$tailnotAlloc
## [1] 6e-04
The full prior predictive can be visualised as histogram. The prior probability that proteins are allocated to different components between datasets concentrates around 0.
hist(predDirPrior$priornotAlloc, col = getStockcol()[1])
We are now ready to run the main bandle function. Remember to carefully
select the datasets and replicates that define the control and treatment. Here
for convenience of building the vignette we only run 2 of the triplicates for
each condition and run the bandle function for a small number of iterations to minimise
the vignette build-time. Typically we’d recommend you run the number of
iterations (numIter) in the \(1000\)s.
control <- list(thplopit[[1]], thplopit[[2]])
treatment <- list(thplopit[[3]], thplopit[[4]])
bandleres <- bandle(objectCond1 = control,
objectCond2 = treatment,
numIter = 50, # usually 10,000
burnin = 5, # usually 5,000
thin = 1, # usually 20
gpParams = gpParams,
pcPrior = pc_prior,
numChains = 1, # usually >=4
dirPrior = dirPrior)
##
|
| | 0%
|
|= | 2%
|
|=== | 4%
|
|==== | 6%
|
|====== | 8%
|
|======= | 10%
|
|======== | 12%
|
|========== | 14%
|
|=========== | 16%
|
|============= | 18%
|
|============== | 20%
|
|=============== | 22%
|
|================= | 24%
|
|================== | 26%
|
|==================== | 28%
|
|===================== | 30%
|
|====================== | 32%
|
|======================== | 34%
|
|========================= | 36%
|
|=========================== | 38%
|
|============================ | 40%
|
|============================= | 42%
|
|=============================== | 44%
|
|================================ | 46%
|
|================================== | 48%
|
|=================================== | 50%
|
|==================================== | 52%
|
|====================================== | 54%
|
|======================================= | 56%
|
|========================================= | 58%
|
|========================================== | 60%
|
|=========================================== | 62%
|
|============================================= | 64%
|
|============================================== | 66%
|
|================================================ | 68%
|
|================================================= | 70%
|
|================================================== | 72%
|
|==================================================== | 74%
|
|===================================================== | 76%
|
|======================================================= | 78%
|
|======================================================== | 80%
|
|========================================================= | 82%
|
|=========================================================== | 84%
|
|============================================================ | 86%
|
|============================================================== | 88%
|
|=============================================================== | 90%
|
|================================================================ | 92%
|
|================================================================== | 94%
|
|=================================================================== | 96%
The bandle function generates an object of class bandleParams. The show
method indicates the number of parallel chains that were run, this should
typically be greater than 4 (here we use 1 just as a demo).
bandleres
## Object of class "bandleParams"
## Method: bandle
## Number of chains: 1
bandle outputBefore we can begin to extract protein allocation information and a list of
proteins which are differentially localised between conditions, we first need to
populate the bandleres object by calling the bandleProcess function.
bandleres objectCurrently, the summary slots of the bandleres object are empty. The
summaries function accesses them.
summaries(bandleres)
## [[1]]
## An object of class "bandleSummary"
## Slot "posteriorEstimates":
## DataFrame with 0 rows and 0 columns
##
## Slot "diagnostics":
## <0 x 0 matrix>
##
## Slot "bandle.joint":
## <0 x 0 matrix>
##
##
## [[2]]
## An object of class "bandleSummary"
## Slot "posteriorEstimates":
## DataFrame with 0 rows and 0 columns
##
## Slot "diagnostics":
## <0 x 0 matrix>
##
## Slot "bandle.joint":
## <0 x 0 matrix>
These can be populated as follows
bandleres <- bandleProcess(bandleres)
These slot have now been populated
summary(summaries(bandleres))
## Length Class Mode
## [1,] 1 bandleSummary S4
## [2,] 1 bandleSummary S4
The posteriorEstimates slot gives posterior quantities of interest for
different proteins. The object is of length 2, 1 for control and 1 for
treatment.
length(summaries(bandleres))
## [1] 2
One quantity of interest is the protein allocations, which we can plot in a
barplot. We create a new object from populating the bandleres objects using
the summaries function.
par(mfrow = c(1, 2), oma = c(6,2,2,2))
pe1 <- summaries(bandleres)[[1]]@posteriorEstimates
pe2 <- summaries(bandleres)[[2]]@posteriorEstimates
barplot(table(pe1$bandle.allocation), col = getStockcol()[2], las = 2, main = "Protein allocation: control")
barplot(table(pe2$bandle.allocation), col = getStockcol()[2], las = 2, main = "Protein allocation: treatment")
The bar plot tells us for this case study that bandle has allocated the
majority of unlabelled proteins to the nucleus (irrespective of the posterior
probability).
As previously mentioned the term “differentially localised” is used to pertain to proteins which are assigned to different sub-cellular localisations between two conditions. For the majority of users this is the output they are keen to extract using the BANDLE method.
Following on from the above example, after extracting posterior estimates for
each condition using the summaries function we can also access the
differential localisation probability as it is also stored here. The
differential localisation probability tells us which proteins are most likely to
differentially localise. The rank plot is a good visualisation. Indicating
that most proteins are not differentially localised and there are a few
confident differentially localised protiens of interest.
diffloc_probs <- pe1$bandle.differential.localisation
plot(diffloc_probs[order(diffloc_probs, decreasing = TRUE)],
col = getStockcol()[3], pch = 19, ylab = "Probability",
xlab = "Rank", main = "Differential localisation rank plot")
If we take a 1% FDR and examine how many proteins get a differential probability greater than 0.99 we find there are 626 proteins above this threshold.
length(which(diffloc_probs[order(diffloc_probs, decreasing = TRUE)] > 0.99))
## [1] 625
bootstrapdiffLocprob functionWe can examine the top n proteins (here top = 100) and produce bootstrap estimates
of the uncertainty (note here the uncertainty is likely to be underestimated
as we did not produce many MCMC samples). These can be visualised as ranked
boxplots
set.seed(1)
bt <- bootstrapdiffLocprob(params = bandleres, top = 100)
boxplot(t(bt), col = getStockcol()[5],
las = 2, ylab = "Probability", ylim = c(0, 1),
main = "Differential localisation \nprobability plot (top 100 proteins)")
binomDiffLoc functionIn addition to the
We can visualise the changes in localisation between conditions on an alluvial plot using the plotTranslocations function
plotTranslocations(bandleres)
Or alternatively, on a chord (circos) diagram
plotTranslocations(bandleres, type = "chord")
The allocation probabilities are stored in the tagm.joint slot. These could
be visualised in a heatmap
bjoint_control <- summaries(bandleres)[[1]]@bandle.joint
pheatmap(bjoint_control, cluster_cols = FALSE, color = viridis(n = 25))
bjoint_treatment <- summaries(bandleres)[[2]]@bandle.joint
pheatmap(bjoint_treatment, cluster_cols = FALSE, color = viridis(n = 25))
–> –> –> –>
All software and respective versions used to produce this document are listed below.
sessionInfo()
## R version 4.0.5 (2021-03-31)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] viridis_0.6.0 viridisLite_0.4.0 pheatmap_1.0.12
## [4] pRolocdata_1.29.1 ggplot2_3.3.3 bandle_1.0
## [7] pRoloc_1.30.0 BiocParallel_1.24.1 MLInterfaces_1.70.0
## [10] cluster_2.1.1 annotate_1.68.0 XML_3.99-0.6
## [13] AnnotationDbi_1.52.0 IRanges_2.24.1 MSnbase_2.16.1
## [16] ProtGenerics_1.22.0 mzR_2.24.1 Rcpp_1.0.7
## [19] Biobase_2.50.0 S4Vectors_0.28.1 BiocGenerics_0.36.0
## [22] BiocStyle_2.18.1
##
## loaded via a namespace (and not attached):
## [1] circlize_0.4.12 BiocFileCache_1.14.0 plyr_1.8.6
## [4] splines_4.0.5 digest_0.6.27 foreach_1.5.1
## [7] htmltools_0.5.1.1 magick_2.7.3 gdata_2.18.0
## [10] ggalluvial_0.12.3 fansi_0.4.2 magrittr_2.0.1
## [13] memoise_2.0.0 doParallel_1.0.16 mixtools_1.2.0
## [16] limma_3.46.0 recipes_0.1.15 gower_0.2.2
## [19] askpass_1.1 lpSolve_5.6.15 prettyunits_1.1.1
## [22] colorspace_2.0-0 ggrepel_0.9.1 blob_1.2.1
## [25] rappdirs_0.3.3 xfun_0.22 dplyr_1.0.5
## [28] crayon_1.4.1 jsonlite_1.7.2 hexbin_1.28.2
## [31] impute_1.64.0 survival_3.2-10 iterators_1.0.13
## [34] glue_1.4.2 gtable_0.3.0 ipred_0.9-11
## [37] zlibbioc_1.36.0 kernlab_0.9-29 shape_1.4.5
## [40] scales_1.1.1 vsn_3.58.0 mvtnorm_1.1-1
## [43] DBI_1.1.1 xtable_1.8-4 progress_1.2.2
## [46] bit_4.0.4 proxy_0.4-25 mclust_5.4.7
## [49] preprocessCore_1.52.1 lbfgs_1.2.1 lava_1.6.9
## [52] prodlim_2019.11.13 sampling_2.9 httr_1.4.2
## [55] FNN_1.1.3 RColorBrewer_1.1-2 ellipsis_0.3.1
## [58] farver_2.1.0 pkgconfig_2.0.3 nnet_7.3-15
## [61] sass_0.3.1 dbplyr_2.1.1 utf8_1.2.1
## [64] caret_6.0-86 tidyselect_1.1.0 rlang_0.4.10
## [67] reshape2_1.4.4 munsell_0.5.0 tools_4.0.5
## [70] LaplacesDemon_16.1.4 cachem_1.0.4 generics_0.1.0
## [73] RSQLite_2.2.6 evaluate_0.14 stringr_1.4.0
## [76] fastmap_1.1.0 mzID_1.28.0 yaml_2.2.1
## [79] ModelMetrics_1.2.2.2 knitr_1.33 bit64_4.0.5
## [82] purrr_0.3.4 randomForest_4.6-14 dendextend_1.14.0
## [85] ncdf4_1.17 nlme_3.1-152 xml2_1.3.2
## [88] biomaRt_2.46.3 compiler_4.0.5 curl_4.3
## [91] e1071_1.7-6 affyio_1.60.0 tibble_3.1.0
## [94] bslib_0.2.4 stringi_1.5.3 highr_0.8
## [97] lattice_0.20-41 Matrix_1.3-2 vctrs_0.3.7
## [100] pillar_1.6.0 lifecycle_1.0.0 BiocManager_1.30.12
## [103] GlobalOptions_0.1.2 jquerylib_0.1.3 MALDIquant_1.19.3
## [106] data.table_1.14.0 R6_2.5.0 pcaMethods_1.82.0
## [109] affy_1.68.0 bookdown_0.21 gridExtra_2.3
## [112] codetools_0.2-18 MASS_7.3-53.1 gtools_3.8.2
## [115] assertthat_0.2.1 openssl_1.4.3 withr_2.4.1
## [118] hms_1.0.0 grid_4.0.5 rpart_4.1-15
## [121] timeDate_3043.102 tidyr_1.1.3 coda_0.19-4
## [124] class_7.3-18 rmarkdown_2.7 segmented_1.3-3
## [127] pROC_1.17.0.1 lubridate_1.7.10